Assessment: Gemini Batch Fix & Dataset Preview Row Limiting by vprashrex · Pull Request #820 · ProjectTech4DevAI/kaapi-backend

vprashrex · 2026-05-09T10:13:45Z

Target issue: #830

Summary

This pull request addresses two key issues within the assessment module:

Fixed bugs in Gemini batch processing to improve the reliability and stability of AI assessment execution and testing workflows.
Added limit_row support to the dataset preview endpoint, allowing clients to fetch only a limited number of dataset rows instead of the full dataset.
Previously, large dataset responses caused frontend browser lag and UI freezes due to excessive data rendering on the client side. By introducing row limiting, the frontend can now request lightweight previews (for example, 5 rows), resulting in improved performance and a smoother user experience.

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

New Features
- GET dataset endpoint can return a lightweight preview (column headers + first N rows) via optional limit_rows (1–100).
Documentation
- API docs updated to describe the new limit_rows preview option and its behavior.
Refactor
- Batch submission format now emits row identifiers at the top level for clearer row tracking.
Bug Fixes
- Preview requests return appropriate HTTP error responses for missing/invalid or unsupported files.
Tests
- Test coverage added/updated for dataset preview behavior and batch identifier location.

coderabbitai · 2026-05-09T10:13:53Z

📝 Walkthrough

Walkthrough

Adds optional dataset preview (headers + first N rows) to GET /datasets/{id} with models, service parsing CSV/XLSX, docs and tests; moves Gemini/Google JSONL row identifier to a top-level key (tests updated).

Changes

Assessment dataset preview

Layer / File(s)	Summary
Preview Pydantic models `backend/app/models/assessment.py`	Adds `AssessmentDatasetPreview` and an optional `preview` field to `AssessmentDatasetResponse`.
Preview parsing service `backend/app/services/assessment/dataset.py`	Adds `_stringify`, `_preview_csv`, `_preview_excel`, and `preview_dataset` to fetch and parse CSV/XLSX previews and return headers + rows, with HTTP error handling.
API handler and wiring for preview `backend/app/api/routes/assessment/datasets.py`	Imports preview types/service, extends `_dataset_to_response` to accept `preview`, adds `limit_rows` query param, builds `AssessmentDatasetPreview` when requested, and includes it in responses.
Endpoint docs `backend/app/api/docs/assessment/get_dataset.md`	Documents `limit_rows` (1–100) parameter and that omitting it avoids fetching the underlying file.
Preview tests `backend/app/tests/assessment/test_dataset.py`, `backend/app/tests/assessment/test_routes.py`	Adds tests for CSV/XLSX preview outputs, encoding fallbacks, error cases, and route-level preview behavior.

Gemini Batch JSONL Schema

Layer / File(s)	Summary
JSONL Row Identifier Schema `backend/app/crud/assessment/batch.py`, `backend/app/tests/assessment/test_batch.py`	`build_google_jsonl` now emits row identifier as top-level `key` instead of `metadata.key`; test assertion updated accordingly.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

ProjectTech4DevAI/kaapi-backend#788: Earlier changes to Gemini/Google batch JSONL construction touching the same batch code paths.

Suggested labels

enhancement

Suggested reviewers

kartpop
AkhileshNegi
Ayush8923

Poem

🐰 A key pops up where it’s easy to see,
Preview hops in with a header and three,
CSV and sheets, trimmed neat and sweet,
Rows and columns met on a tiny treat,
Hooray — the backend’s lighter on its feet!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 19.23% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title accurately summarizes the two main changes: a Gemini batch schema fix and dataset preview row limiting functionality.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chore/assessment-gemini-batch-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…ints

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/app/api/routes/assessment/datasets.py`:
- Around line 149-161: The truncated flag is over-reported because the code
treats len(rows) >= limit_rows as truncated; to fix, request one extra row from
preview_assessment_dataset (call with limit=limit_rows + 1), set truncated =
len(rows) > limit_rows, and if truncated trim rows to the original limit_rows
before constructing AssessmentDatasetPreview (use the existing names session,
dataset, limit_rows, preview_assessment_dataset, headers, rows, and
AssessmentDatasetPreview).

In `@backend/app/services/assessment/dataset.py`:
- Around line 197-219: The current preview logic defaults to CSV for any
non-".xlsx" file_ext which can silently mis-handle missing/invalid metadata;
update the preview path in the preview function (where file_ext is derived) to
validate file_ext explicitly (normalize with .lower() and strip), and only allow
known extensions like ".xlsx" and ".csv"; if file_ext is None or not in the
allowed set, raise HTTPException(status_code=422, detail="Unsupported or missing
file extension.") instead of calling _preview_csv, otherwise call _preview_excel
for ".xlsx" and _preview_csv for ".csv".

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 60a75072-f1b1-49b2-b789-0b3989427bea

📥 Commits

Reviewing files that changed from the base of the PR and between e08abbc and 15ad20d.

📒 Files selected for processing (5)

backend/app/api/docs/assessment/get_dataset.md
backend/app/api/routes/assessment/datasets.py
backend/app/models/assessment.py
backend/app/services/assessment/dataset.py
backend/app/tests/assessment/test_batch.py

✅ Files skipped from review due to trivial changes (1)

backend/app/api/docs/assessment/get_dataset.md

coderabbitai · 2026-05-12T05:18:58Z

+    file_ext = (dataset.dataset_metadata or {}).get("file_extension")
+    if file_ext == ".xls":
+        raise HTTPException(
+            status_code=422,
+            detail="Legacy Excel format (.xls) is not supported.",
+        )
+
+    storage = get_cloud_storage(session=session, project_id=project_id)
+    try:
+        content = storage.get(dataset.object_store_url)
+    except Exception as e:
+        logger.warning(
+            f"[preview_dataset] Failed to fetch file | dataset_id={dataset.id} | {e}",
+            exc_info=True,
+        )
+        raise HTTPException(
+            status_code=502, detail="Failed to fetch dataset file from storage."
+        ) from e
+
+    try:
+        if file_ext == ".xlsx":
+            return _preview_excel(content, limit)
+        return _preview_csv(content, limit)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject unknown/missing file extensions instead of defaulting to CSV.

Line 219 defaults to CSV parsing for non-.xlsx values. If metadata is missing/incorrect, preview can return garbage instead of a clear 422.

Suggested fix

- file_ext = (dataset.dataset_metadata or {}).get("file_extension") + file_ext = ((dataset.dataset_metadata or {}).get("file_extension") or "").lower() if file_ext == ".xls": raise HTTPException( status_code=422, detail="Legacy Excel format (.xls) is not supported.", ) + if file_ext not in {".csv", ".xlsx"}: + raise HTTPException( + status_code=422, + detail="Unsupported or missing dataset file extension for preview.", + ) ... - if file_ext == ".xlsx": + if file_ext == ".xlsx": return _preview_excel(content, limit) return _preview_csv(content, limit)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

file_ext = (dataset.dataset_metadata or {}).get("file_extension")

if file_ext == ".xls":

raise HTTPException(

status_code=422,

detail="Legacy Excel format (.xls) is not supported.",

)

storage = get_cloud_storage(session=session, project_id=project_id)

try:

content = storage.get(dataset.object_store_url)

except Exception as e:

logger.warning(

f"[preview_dataset] Failed to fetch file | dataset_id={dataset.id} | {e}",

exc_info=True,

)

raise HTTPException(

status_code=502, detail="Failed to fetch dataset file from storage."

) from e

try:

if file_ext == ".xlsx":

return _preview_excel(content, limit)

return _preview_csv(content, limit)

file_ext = ((dataset.dataset_metadata or {}).get("file_extension") or "").lower()

if file_ext == ".xls":

raise HTTPException(

status_code=422,

detail="Legacy Excel format (.xls) is not supported.",

)

if file_ext not in {".csv", ".xlsx"}:

raise HTTPException(

status_code=422,

detail="Unsupported or missing dataset file extension for preview.",

)

storage = get_cloud_storage(session=session, project_id=project_id)

try:

content = storage.get(dataset.object_store_url)

except Exception as e:

logger.warning(

f"[preview_dataset] Failed to fetch file | dataset_id={dataset.id} | {e}",

exc_info=True,

)

raise HTTPException(

status_code=502, detail="Failed to fetch dataset file from storage."

) from e

try:

if file_ext == ".xlsx":

return _preview_excel(content, limit)

return _preview_csv(content, limit)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/services/assessment/dataset.py` around lines 197 - 219, The current preview logic defaults to CSV for any non-".xlsx" file_ext which can silently mis-handle missing/invalid metadata; update the preview path in the preview function (where file_ext is derived) to validate file_ext explicitly (normalize with .lower() and strip), and only allow known extensions like ".xlsx" and ".csv"; if file_ext is None or not in the allowed set, raise HTTPException(status_code=422, detail="Unsupported or missing file extension.") instead of calling _preview_csv, otherwise call _preview_excel for ".xlsx" and _preview_csv for ".csv".

…r validation

codecov · 2026-05-12T06:53:12Z

Codecov Report

❌ Patch coverage is 97.77778% with 4 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/app/services/assessment/dataset.py	93.84%	4 Missing ⚠️

📢 Thoughts on this report? Let us know!

…clude preview with limit_rows

coderabbitai

🧹 Nitpick comments (3)

backend/app/tests/assessment/test_dataset.py (3)
148-163: 💤 Low value

Consider importing openpyxl at module level.

openpyxl is already imported at line 7 for InvalidFileException. Importing it again inside the test function (lines 149, 151) is inconsistent with the module-level import pattern.
♻️ Proposed consolidation

At the top of the file, consolidate the imports:
 from openpyxl.utils.exceptions import InvalidFileException
+import openpyxl
+import io
Then remove the inline imports in the test functions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/tests/assessment/test_dataset.py` around lines 148 - 163, The
test function test_preview_excel_returns_headers_and_rows imports openpyxl
locally even though openpyxl is already imported at module level for
InvalidFileException; remove the inline imports inside
test_preview_excel_returns_headers_and_rows and any other tests, and add/ensure
a single module-level import for openpyxl alongside InvalidFileException so the
test uses that top-level import instead.
142-146: ⚡ Quick win

Strengthen the latin-1 fallback assertion.

The test claims to verify latin-1 fallback but only checks that the value starts with "ca". It should verify that the invalid UTF-8 byte \xff was correctly decoded as ÿ (U+00FF in latin-1) rather than dropped.
✨ Proposed stronger assertion
     def test_preview_csv_handles_latin1_fallback(self) -> None:
         # \xff is invalid utf-8 -> falls back to latin-1
         headers, rows = _preview_csv(b"name\nca\xfffe\n", limit=5)
         assert headers == ["name"]
-        assert rows and rows[0][0].startswith("ca")
+        assert rows and rows[0][0] == "caÿfe"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/tests/assessment/test_dataset.py` around lines 142 - 146, Update
the test_preview_csv_handles_latin1_fallback assertion to verify the latin-1
decoded character is present: call _preview_csv as before, then assert that the
first cell exactly equals "caÿ" or contains the Unicode character U+00FF (ÿ) to
ensure the invalid UTF-8 byte 0xFF was decoded via latin-1; refer to the test
function name test_preview_csv_handles_latin1_fallback and the helper
_preview_csv when locating the change.
165-175: ⚡ Quick win

Clarify expected behavior for empty workbooks.

The assertion at line 174 accepts two different outcomes ([""] or []), which suggests either:

The expected behavior for empty workbooks is not well-defined, or

The test is being overly permissive.

Consider determining the correct expected behavior and asserting only that outcome.
♻️ Proposed fix

If empty workbooks should return an empty list:
-        assert headers == [""] or headers == []
+        assert headers == []
Or if they should return a list with one empty string:
-        assert headers == [""] or headers == []
+        assert headers == [""]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/tests/assessment/test_dataset.py` around lines 165 - 175, The
test test_preview_excel_empty_workbook is ambiguous because it accepts two
outcomes for headers; decide the canonical behavior for _preview_excel (either
return [] for no headers or [""] to represent a single empty header) and update
the test to assert that single expected value only; locate the test function
test_preview_excel_empty_workbook and the helper _preview_excel, then change the
assertion to assert headers == <chosen_expected_value> (and keep assert rows ==
[]), or if you choose to change _preview_excel instead, make it return the
chosen headers shape for an empty workbook and keep the test asserting that one
outcome.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@backend/app/tests/assessment/test_dataset.py`:
- Around line 148-163: The test function
test_preview_excel_returns_headers_and_rows imports openpyxl locally even though
openpyxl is already imported at module level for InvalidFileException; remove
the inline imports inside test_preview_excel_returns_headers_and_rows and any
other tests, and add/ensure a single module-level import for openpyxl alongside
InvalidFileException so the test uses that top-level import instead.
- Around line 142-146: Update the test_preview_csv_handles_latin1_fallback
assertion to verify the latin-1 decoded character is present: call _preview_csv
as before, then assert that the first cell exactly equals "caÿ" or contains the
Unicode character U+00FF (ÿ) to ensure the invalid UTF-8 byte 0xFF was decoded
via latin-1; refer to the test function name
test_preview_csv_handles_latin1_fallback and the helper _preview_csv when
locating the change.
- Around line 165-175: The test test_preview_excel_empty_workbook is ambiguous
because it accepts two outcomes for headers; decide the canonical behavior for
_preview_excel (either return [] for no headers or [""] to represent a single
empty header) and update the test to assert that single expected value only;
locate the test function test_preview_excel_empty_workbook and the helper
_preview_excel, then change the assertion to assert headers ==
<chosen_expected_value> (and keep assert rows == []), or if you choose to change
_preview_excel instead, make it return the chosen headers shape for an empty
workbook and keep the test asserting that one outcome.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3ab2e4d5-c58f-4731-ada1-cfe46d7ba28d

📥 Commits

Reviewing files that changed from the base of the PR and between fa5e476 and 0a20a8b.

📒 Files selected for processing (2)

backend/app/tests/assessment/test_dataset.py
backend/app/tests/assessment/test_routes.py

Assessment (HotFix): Gemini Batch Fix

e08abbc

vprashrex requested a review from Prajna1999 May 11, 2026 05:34

vprashrex self-assigned this May 11, 2026

vprashrex added the ready-for-review label May 11, 2026

Prajna1999 approved these changes May 12, 2026

View reviewed changes

Add dataset preview functionality and update related models and endpo…

15ad20d

…ints

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

Refactor: Update limit_rows parameter type to use Annotated for bette…

fa5e476

…r validation

vprashrex added 2 commits May 12, 2026 19:17

Merge branch 'main' into chore/assessment-gemini-batch-fix

beee4bc

Add dataset preview tests and enhance get_dataset functionality to in…

0a20a8b

…clude preview with limit_rows

vprashrex changed the title ~~Assessment (HotFix): Gemini Batch Fix~~ Assessment: Gemini Batch Fix May 12, 2026

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

vprashrex requested a review from Ayush8923 May 12, 2026 13:59

vprashrex changed the title ~~Assessment: Gemini Batch Fix~~ Assessment: Gemini Batch Fix & Dataset Preview Row Limiting May 12, 2026

Ayush8923 approved these changes May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assessment: Gemini Batch Fix & Dataset Preview Row Limiting#820

Assessment: Gemini Batch Fix & Dataset Preview Row Limiting#820
vprashrex wants to merge 5 commits into
mainfrom
chore/assessment-gemini-batch-fix

vprashrex commented May 9, 2026 •

edited by Ayush8923

Loading

Uh oh!

coderabbitai Bot commented May 9, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot May 12, 2026

Uh oh!

codecov Bot commented May 12, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vprashrex commented May 9, 2026 • edited by Ayush8923 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Target issue: #830

Summary

Checklist

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vprashrex commented May 9, 2026 •

edited by Ayush8923

Loading

coderabbitai Bot commented May 9, 2026 •

edited

Loading

codecov Bot commented May 12, 2026 •

edited

Loading